Sequence | 1000 Genomes

Can I get image files for any of the 1000 Genomes sequencing runs?

Answer:

The image files produced by next generation sequencing runs are always very big. The production centres do not submit them to the archives and they are not available to downstream users.

Is the data for the pilot study still available?

Answer:

All the pilot data remains on our ftp site under the pilot_data directory EBI/NCBI. The variants which are discussed in the pilot paper can also be found on the ftp site EBI/NCBI.

Please note these data are all mapped to the NCBI36 human reference.

What is the difference between the sequence.index and the analysis.sequence.index?

Answer:

The sequence.index file contains a list of all the sequence data produced by the project, pointers to the file locations on the ftp site and also all the meta data associated with each sequencing run.

For the phase 3 analysis the consortium has decided to only use Illumina platform sequence data with reads of 70 base pairs or longer. The analysis.sequence.index file contains only the active runs which match this criterion. There are withdrawn runs in this index. These runs are withdrawn because either: * They have insufficient raw sequence to meet our 3x non duplicated aligned coverage criteria for low coverage alignments. * After the alignment has been run they have failed our post alignment quality controls for short indels. * Contamination. * They do not meet our coverage criteria.

Since the alignment release based on 20120522, we have only released alignments based on the analysis.sequence.index

What format are your sequence files?

Answer:

Our sequence files are distributed in FASTQ format

We use Sanger style phred scaled quality encoding

The files are all gzipped compressed and the format looks like this, with a 4 line repeating pattern

@ERR059938.60 HS9_6783:8:2304:19291:186369#7/2
GTCTCCGGGGGCTGGGGGAACCAGGGGTTCCCACCAACCACCCTCACTCAGCCTTTTCCCTCCAGGCATCTCTGGGAAAGGACATGGGGCTGGTGCGGGG
+
7?CIGJB:D:-F7LA:GI9FDHBIJ7,GHGJBKHNI7IN,EML8IFIA7HN7J6,L6686LCJE?JKA6G7AK6GK5C6@6IK+++?5+=<;227*6054

What is a sequence index file?

Answer:

We describe our sequence meta data in sequence index files. The index for data from the 1000 Genomes Project can be found in the 1000 Genomes data collection directory. Additional indices are present for data in other data collections. Our old index files which describe the data used in the main project can be found in the historical_data directory

Sequence index files are tab delimited files and frequently contain these columns:

Column	Title	Description
1	FASTQ_FILE	path to fastq file on ftp site or ENA ftp site
2	MD5	md5sum of file
3	RUN_ID	SRA/ERA run accession
4	STUDY_ID	SRA/ERA study accession
5	STUDY_NAME	Name of study
6	CENTER_NAME	Submission centre name
7	SUBMISSION_ID	SRA/ERA submission accession
8	SUBMISSION_DATE	Date sequence submitted, YYYY-MM-DD
9	SAMPLE_ID	SRA/ERA sample accession
10	SAMPLE_NAME	Sample name
11	POPULATION	Sample population, this is a 3 letter code as defined in README_populations.md
12	EXPERIMENT_ID	Experiment accession
13	INSTRUMENT_PLATFORM	Type of sequencing machine
14	INSTRUMENT_MODEL	Model of sequencing machine
15	LIBRARY_NAME	Library name
16	RUN_NAME	Name of machine run
17	RUN_BLOCK_NAME	Name of machine run sector (This is no longer recorded so this column is entirely null, it was left in so as not to disrupt existing sequence index parsers)
18	INSERT_SIZE	Submitter specified insert size
19	LIBRARY_LAYOUT	Library layout, this can be either PAIRED or SINGLE
20	PAIRED_FASTQ	Name of mate pair file if exists (Runs with failed mates will have a library layout of PAIRED but no paired fastq file)
21	WITHDRAWN	0/1 to indicate if the file has been withdrawn, only present if a file has been withdrawn
22	WITHDRAWN_DATE	This is generally the date the file is generated on
23	COMMENT	comment about reason for withdrawal
24	READ_COUNT	read count for the file
25	BASE_COUNT	basepair count for the file
26	ANALYSIS_GROUP	the analysis group of the sequence, this reflects sequencing strategy. For 1000 Genomes Project data, this includes low coverage, high coverage, exon targeted and exome to reflect the two non low coverage pilot sequencing strategies and the two main project sequencing strategies used by the 1000 Genomes Project.

Where are your sequence files located?

Answer:

Our sequence files are distributed in fastq format and can be found under the data directory of the ftp site, here there is a directory per individual which then contains all the sequence data we have for that individual aswell as all the alignment data we have.

We also distribute meta data for all our sequencing runs in a sequence.index file which is described in a README on the ftp site.

Why are there more than one set of fastq files associated with an individual?

Answer:

Many of our individuals have multiple fastq files. This is because many of our individual were sequenced using more than one run of a sequencing machine.

Each set of files named like ERR001268_1.filt.fastq.gz, ERR001268_2.filt.fastq.gz and ERR001268.filt.fastq.gz represent all the sequence from a sequencing run.

When a individual has many files with different run accessions (e.g ERR001268), this means it was sequenced multiple times. This can either be for the same experiment, some centres used multiplexing to have better control over their coverage levels for the low coverage sequencing, or because it was sequenced using different protocols or on different platforms.

For a full description of the sequencing conducted for the project please look at our sequence.index file

What percentage of the genome is assayable?

Answer:

The 1000 Genomes Project created what they defined as accessibilty masks for the pilot phase, phase 1 and phase 3 of the Project.

Pilot

The pilot mask showed that only 85% of the genome is accessible to accurate analysis with the short read technology used by the 1000 Genomes pilot project. The remaining 15% is either repeats or segmental duplications. There is more information about the pilot mask in README.callability_masks.

Phase 1

For the phase 1 analysis using the pilot callability criteria 94% of the genome was accessible. A stricter mask was also created for the phase 1 project to be used for population genetics analysis; this mask used a narrower band of coverage criteria and also insisted that less than 0.1% of reads have a mapping quality of 0 and the average mapping quality should be 56 or higher. These criteria lead to 72.2% of the genome being accessible to accurate analysis with the short read technology used at that time by the 1000 Genomes Project. Further information is in section 10.4 of the supplementary material from the phase 1 publication.

Phase 3

In phase 3, using the pilot criteria 95.9% of the genome was found to be accessible. For the stricter mask created during phase 3, 76.9% was found to be accessible. A detailed description of the accessibility masks created during phase 3, the final phase of the Project, can be found in section 9.2 of the supplementary material for the main publication. The percentages quoted are for non-N bases.

IGSR: The International Genome Sample Resource

Supporting open human variation data

Links

Can I get image files for any of the 1000 Genomes sequencing runs?

Answer:

Related questions:

Is the data for the pilot study still available?

Answer:

Related questions:

What is the difference between the sequence.index and the analysis.sequence.index?

Answer:

Related questions:

What format are your sequence files?

Answer:

Related questions:

What is a sequence index file?

Answer:

Related questions:

Where are your sequence files located?

Answer:

Related questions:

Why are there more than one set of fastq files associated with an individual?

Answer:

Related questions:

What percentage of the genome is assayable?

Answer:

Pilot

Phase 1

Phase 3

Related questions: